fix(core): prevent isolate-wide hang from a stalled session.get on Workers (#1274)#1538
Conversation
…rkers (emdash-cms#1274) A request cancelled mid-`session.get("user")` on Cloudflare Workers can leave the session-store read as a promise that never settles, poisoning the isolate so every later session-bearing request hangs (≈0 CPU, multi-minute wall, `canceled`) — reliably reproducible on fresh isolates right after a deploy. The auth middleware now resolves the session user via a helper that: 1. anchors the read with `after()` so a cancelled request still drives it to completion (the promise settles; the isolate is not poisoned), and 2. applies a fail-closed timeout backstop — a still-stalled read degrades to "unauthenticated for this request" instead of hanging.
🦋 Changeset detectedLatest commit: 8dd4180 The changes in this PR will be included in the next version bump. This PR includes changesets to release 16 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
@emdash-cms/admin
@emdash-cms/auth
@emdash-cms/auth-atproto
@emdash-cms/blocks
@emdash-cms/cloudflare
@emdash-cms/contentful-to-portable-text
emdash
create-emdash
@emdash-cms/gutenberg-to-portable-text
@emdash-cms/plugin-cli
@emdash-cms/plugin-types
@emdash-cms/registry-client
@emdash-cms/registry-lexicons
@emdash-cms/sandbox-workerd
@emdash-cms/x402
@emdash-cms/plugin-ai-moderation
@emdash-cms/plugin-atproto
@emdash-cms/plugin-audit-log
@emdash-cms/plugin-color
@emdash-cms/plugin-embeds
@emdash-cms/plugin-field-kit
@emdash-cms/plugin-forms
@emdash-cms/plugin-webhook-notifier
commit: |
There was a problem hiding this comment.
Approach: sound, but under-applied. Anchoring the session read with after()/waitUntil so a cancelled request still drives it to settlement — preventing the isolate poisoning rather than merely surviving it — plus a fail-closed timeout backstop, is the right pattern and mirrors the existing single-flight/reclaimable-cache + after() usage in emdash-runtime.ts, settings, and secrets. The fail-closed property holds: a stalled or rejecting read resolves to undefined, and every caller treats absence-of-user as unauthenticated, so the helper can only drop privileges, never grant them. The mechanics are correct — read is .catch-wrapped, the after callback handles both settle arms, and clearTimeout runs in finally (no timer leak, no unhandled rejection, no double read).
The headline problem is coverage. This PR hardens the three session.get("user") sites in auth.ts, but two structurally identical bare awaits are left untouched:
packages/core/src/astro/middleware.ts:419— the main middleware, registered beforeauth(inintegration/index.ts, allorder: "pre"in registration order:emdash/middleware→ redirect → setup →emdash/middleware/auth). So on a session-bearing request likeGET /_emdash/adminthis is the first session read, before the hardened auth middleware ever runs. By this PR's own theory, a client disconnect during thatawaitleaves a never-settling promise that hangs the request here and poisons the isolate — the auth-middleware fix never executes. The fix as written doesn't fully prevent the hang it set out to prevent.packages/core/src/astro/routes/api/snapshot.ts:34— this route is inPUBLIC_API_EXACT, so the auth middlewarereturn next()s in theisPublicApiRoutebranch before any session read; this manual read is the endpoint's only session lookup, and its surroundingtry/catchcan't catch a never-settling promise (as the PR itself notes).
resolveSessionUser is currently module-local to auth.ts; extracting it to a shared module and applying it at both remaining sites closes the gap (the main middleware can keep its !hasSessionCookie short-circuit, which is the #733 KV-read-miss optimization).
Tests. None added. The workerd cancellation race itself isn't unit-testable, but the helper's two safety behaviors are: the timeout backstop (pass a session whose get("user") returns new Promise(() => {}) and assert resolveSessionUser resolves to undefined within the timeout instead of hanging) and fail-closed-on-reject. The infrastructure already exists — tests/unit/astro/middleware-prerender.test.ts mocks session.get and asserts on it for the #733 regression. Flagged below.
Checked: after()/clearTimeout/Promise.race mechanics; the fail-closed security property; import path and ordering; changeset (correct emdash/patch); every session.get("user") call site repo-wide (these two are the misses). Static review only — did not run typecheck/tests/lint.
Verdict: comment — the approach is right and the hardened paths are correct, but the same hole remains in the main middleware (the first session read on the request) and the snapshot route, and the testable backstop lacks a test.
Findings
-
[needs fixing]
packages/core/src/astro/middleware.ts:418-419Line 419 still does a bare
await context.session?.get("user")with noafter()anchor and no timeout backstop — exactly the pattern this PR fixes inauth.ts. Crucially, this middleware runs first: inintegration/index.tsit's registered beforeemdash/middleware/auth(bothorder: "pre", in registration order), so for a session-bearing request likeGET /_emdash/adminthis is the first session read, before the hardened auth middleware runs at all.By this PR's own theory (an unanchored read cancelled mid-flight leaves a never-settling promise that hangs the request and poisons the isolate), this read is equally susceptible: a client disconnect during this
awaitstalls it here, the request hangs beforeauth.tsexecutes, and the isolate is poisoned. The fix as written does not fully prevent the hang it set out to prevent.resolveSessionUseris module-local toauth.ts; extract it to a shared module and use it here (keeping the!hasSessionCookieshort-circuit, which is the #733 KV-read-miss optimization):const sessionUser = context.isPrerendered || !hasSessionCookie ? null : await resolveSessionUser(context.session);(Requires importing the extracted helper;
resolveSessionUseralready acceptssession | undefinedand returnsundefinedwhen absent, so passingcontext.sessiondirectly is safe.) -
[needs fixing]
packages/core/src/astro/routes/api/snapshot.ts:34Same unprotected pattern: a bare
await session.get("user"). The surroundingtry/catchcannot help — as this PR notes, a stalled read neither resolves nor rejects, so thecatchnever fires and this request hangs (and can poison the isolate).This route is in
PUBLIC_API_EXACT, so the auth middlewarereturn next()s in theisPublicApiRoutebranch before doing any session read. That makes this manual read the endpoint's only session lookup — the PR'sresolveSessionUsernever runs here. Lower traffic than the main middleware, but the same hole. Use the shared helper once it's extracted:const sessionUser = await resolveSessionUser(session); -
[needs fixing]
packages/core/src/astro/middleware/auth.ts:65AGENTS.md requires TDD for bugs ("A bug without a reproducing test is not fixed"), and this helper adds new, load-bearing behavior with no test. The workerd cancellation race itself isn't unit-testable, but the helper's two safety behaviors are:
- Timeout backstop — pass a session whose
get("user")returns a never-settling promise (new Promise(() => {})) and assertresolveSessionUserresolves toundefinedwithin the timeout instead of hanging the test. - Fail-closed on reject — pass a
getthat rejects and assertundefined.
The infrastructure already exists:
tests/unit/astro/middleware-prerender.test.tsmockssession.getand asserts on it for the #733 regression, so this is straightforward. Extracting/exportingresolveSessionUser(which the two findings above require for reuse inmiddleware.tsandsnapshot.ts) also makes it directly unit-testable in isolation. The live Workers traces are good evidence the fix works in production, but they don't guard the backstop logic against a future regression. - Timeout backstop — pass a session whose
|
Can i get my own Emdashbot 😅 he is great |
Addresses review feedback on emdash-cms#1538 — the guard was under-applied. Two structurally identical bare `await session.get("user")` reads were left unhardened: - middleware.ts: the main middleware runs *before* the auth middleware (both order: "pre"), so on a session-bearing request this is the *first* session read — it could hang and poison the isolate before the auth fix ever runs. - routes/api/snapshot.ts: a PUBLIC_API_EXACT route the auth middleware skips, so its manual read is the only session lookup. Extract the guard to a shared `astro/session-user.ts` (`resolveSessionUser`) and apply it at all three sites (main middleware keeps its `!hasSessionCookie` short-circuit, the emdash-cms#733 KV-read-miss optimization). Add unit tests for the two testable safety behaviors: timeout backstop on a never-settling read, and fail-closed on a rejecting read.
|
Great review — you're right, the guard was under-applied. Pushed a follow-up (8dd4180):
Local checks pass: |
|
btw, this is emdashbot if you do want one! https://github.com/emdash-cms/emdash/tree/main/infra/flue-review |
|
I took #1538 for a proper spin on throwaway, isolated infra to check whether it actually clears the admin hang we keep hitting on Cloudflare Workers — and wanted to share what came out, including one gap and one unrelated bug. Setup
What the cold path looks likeA cold isolate spends real time in Warm, the same header is
The #1538 timeout helps with (1) — it stops the unbounded hang — but on a fully wedged cold isolate ( Sanity checks that line up with that reading:
What actually cleared itWrapping the let runtime;
for (let attempt = 0; attempt < 4; attempt++) {
try {
runtime = await getRuntime(config, initSubTimings);
if (runtime?.db) break;
} catch (err) {
if (attempt === 3) throw err;
}
await new Promise((r) => setTimeout(r, 75 * (attempt + 1)));
}With this plus the #1538 session timeout, Smart Placement on, on a freshly deployed (cold) Worker:
So my takeaway: #1538 is necessary (it kills the unbounded hang) but on Workers it isn't sufficient on its own — the per-request Separate bug:
|
|
Great work. Let's get this one in, then follow up with the |
What does this PR do?
Fixes logged-in pages (admin and public) hanging indefinitely on Cloudflare Workers (#1274).
A request cancelled mid-
session.get("user")(client disconnect / context teardown) can leave the underlying session-store read as a promise that never settles — it neither resolves nor rejects. Awaiting it directly hangs the request, and because the stalled promise is shared at the isolate level, every later session-bearing request on that isolate hangs too — the exact 0-CPU / multi-minute-wall /canceled/ no-child-spans signature reported in the issue. It's reliably reproducible on fresh isolates right after a deploy, and a surroundingtry/catchcan't help because the promise never rejects.handlePluginRouteAuth,handlePublicRouteAuthandhandlePasskeyAuthnow resolve the session user through a sharedresolveSessionUser()helper that:after()so a cancelled request still drives it to completion — the promise settles and the isolate is not poisoned for subsequent requests (prevents the hang rather than merely surviving it); andundefined, and every caller already treats the absence of a session user as unauthenticated (anonymous on public routes,401/redirect on protected ones). It can only ever drop privileges for that one request, never grant them.This mirrors the reclaimable-cache /
after()pattern already used in core for settings/secrets.Closes #1274.
Type of change
Checklist
pnpm typecheckpasses (emdashpackage, workspace deps built)pnpm lintpasses (oxlint --type-aware --deny-warnings)pnpm formathas been run (oxfmt+prettier)emdashpatch)AI-generated code disclosure
Verification — production, Cloudflare Workers, EmDash 0.21.0
Captured with
wrangler tail(data anonymised — paths/timings/outcomes only).Before — logged-in requests wedge:
GET /_emdash/admincanceledGET /_emdash/admin/content/…canceledAfter this change — fresh post-deploy isolate, same conditions, 26/26 admin+API requests
ok:GET /_emdash/adminokGET /_emdash/api/dashboardokGET /_emdash/api/content/…okok0
canceled, 0 hangs.